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Algorithms for Rewriting Aggregate Queries 

Using Views 



Abstract. Queries involving aggregation are typical in database appli- 
cations. One of the main ideas to optimize the execution of an aggregate 
query is to reuse results of previously answered queries. This leads to the 
problem of rewriting aggregate queries using views. Due to a lack of the- 
ory, algorithms for this problem were rather ad-hoc. They were sound, 
but were not proven to be complete. 

Recently we have given syntactic characterizations for the equivalence of 
aggregate queries and applied them to decide when there exist rewritings. 
However, these decision procedures do not lend themselves immediately 
to an implementation. In this paper, we present practical algorithms 
for rewriting queries with count and sum. Our algorithms are sound. 
They are also complete for important cases. Our techniques can be used 
to improve well-known procedures for rewriting non-aggregate queries. 
These procedures can then be adapted to obtain algorithms for rewriting 
queries with min and max. The algorithms presented are a basis for 
realizing optimizers that rewrite queries using views. 



1 Introduction 



Aggregate queries occur in many applications, such as data warehousing |TS97|, 
mobile computing BI94 |, and global information systems [LR096t]. The size of 
the database in these applications is generally very large. Aggregation is often 
used in queries against such sources as a means of reducing the granularity of 
data. The execution of aggregate queries tends to be time consuming and costly. 
Computing one aggregate value often requires scanning many data items. This 
makes query optimization a necessity. A promising technique to speed up the 
execution of aggregate queries is to reuse the answers to previous queries to 
answer new queries. If the previous queries involved aggregation, the answers to 
them will tend to be much smaller than the size of the database. Thus, using 
their answers will be much more efficient. 

We call a reformulation of a query that uses other queries a rewriting. Finding 
such rewritings is known as the problem of rewriting queries using views. In 
this phrasing of the problem, it is assumed that there is a set of views, whose 
answers have been stored, or materialized. Given a query, the problem is to 
find a rewriting, which is formulated in terms of the views and some database 
relations, such that evaluating the original query yields the same answers as 
evaluating first the views and then the rewriting. 

Rewriting queries using views has been studied for non-aggregate queries 
[LMSS95|, and algorithms have been devised and implemented | LSK9^, Qia9€ | . 
For aggregate queries, the problem has been investigated mainly in the special 



case of datacubes [EIRU96,Dyr96]. However, there is little theory for general 
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aggregate queries, and the rewriting algorithms that appear in the hterature 
are by and large ad hoc. These algorithms arc sound, that is, the reformulated 
queries they produce are in fact rewritings, but there is neither a guarantee that 
they output a rewriting whenever one exists, nor that they generate all existing 
rewritings ]SD JL96|JGHQ95| . 

Recently, syntactic characterizations for the equivalence of SQL queries with 
the aggregate operators min, max, count, and sum have been given [NSS98|. 
They have been applied to decide, given an aggregate query and a set of views, 
whether there exists a rewriting, and whether a new query over views and base 
relations is a rewriting | CNS9E | . 

Using these characterizations, one can "guess" candidates for rewritings and 
verify if they are in fact equivalent to the original query. However, this pro- 
cess is highly nondeterministic. Clearly, it is more efficient to gradually build 
a candidate for rewriting in a way that will ensure its being a rewriting. The 
characterizations do not immediately yield practical algorithms of this sort. In 
fact, there are several subtle problems that must be dealt with in order to yield 
complete algorithms. 

In this paper, we show how to derive practical algorithms for rewriting aggre- 
gate queries. The algorithms are sound, i.e., they output rewritings. We can also 
show that they are complete for important cases, which are relevant in practice. 
In Section 2 we present a motivating example. A formal framework for rewritings 
of aggregate queries is presented in Section 3. In Section 4 we give algorithms 
for rewriting aggregate queries and in Section 5 we conclude. In Appendix ^ 
we demonstrate how queries written in SQL can be translated to our extended 
Datalog syntax and vice versa. 



2 Motivation 

We discuss an example that illustrates the rewriting problem for aggregate 
queries. All the examples in this paper are written using an extended Data- 
log syntax. This syntax is more abstract and concise than SQL. In Section ^ we 
present a formal definition of the Datalog syntax. In Appendix ^ we describe 
how queries written in SQL can be translated to our Datalog syntax and vice 
versa. 

The following example models exactly the payment policy for teaching assis- 
tants at the Hebrew University in Jerusalem. There are two tables with relations 
pertaining to salaries of teaching assistants: 

ta(n£mie , coursejiame , job_type) and 
salaries ( j ob_type , sponsorship , amount) . 

At the Hebrew University, there may be many teaching assistants in a course. 
Each TA has at least one job_type in the course he assists. For example, he 
may give lectures or grade exercises. Teaching assistants are financed by different 
sources, like science foundations and the university itself. For each job type, each 
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sponsor gives a fixed amount. Thus, a lab instructor may receive $600 per month 
from the university and $400 from a government science foundation. 

We suppose that there are two materiahzed views. In the first one of them, 
v_positions_per_type, we compute the number of positions of each type held 
in the university. In the second view, v_salaryJor_ta_job we compute the 
total salary for each type of position. We express aggregate queries with an 
extended Datalog notation, where in the head we separate grouping variables 
and aggregate terms by a semicolon: 

v_positions_per_type(j; count) <— ta(n,c, j) 
v_salaryjfor_ta_j ob(j; sum(a)) ^ salaries(j', s, a). 



In Subsection 3.2 we define a semantics for such Datalog queries that identi- 
fies them with SQL queries where the attributes in the GROUP BY clause and 
those in the SELECT clause coincide. The grouping variables correspond to those 
attributes. 

In the following query we calculate the total amount of money spent on each 
job position: 

q{j] sum{a)) ^ ta{n, c, j) & salaries(j, s, a) 

An intelligent query optimizer could now reason that for each type of job we can 
calculate the total amount of money spent on it if we multiply the salary that 
one TA receives for such a job by the number of positions of that type. The two 
materialized views contain information that can be combined to yield an answer 
to our query. The optimizer can formulate a new query that only accesses the 
views and does not touch the tables in the database: 

r(j';a' * cnt) <— v_positions_per_type(j'; cnt) & v_salary_f or_ta_job(7'; a') 

In order to evaluate the new query, we no longer need to look up all the teaching 
assistants nor all the financing sources. Thus, probably, the new query can be 
executed more efficiently. 

In this example, we used our common sense in two occasions. First, we gave an 
argument why evaluating the original query yields the same result as evaluating 
the new query that uses the views. Second, because we understood the semantics 
of the original query and the views, we were able to come up with a reformulation 
of the query over the views. Thus, we will only be able to build an optimizer 
that can rewrite aggregate queries, if we can provide answers to the following 
two questions. 

— Rewriting Verification: How can we prove that a new query, which uses 
views, produces the same results as the original query? 

— Rewriting Computation: How can we devise an algorithm that system- 
atically and efficiently finds all rewritings? 



If efficiency and completeness cannot be achieved at the same time, we may 
have to find a good trade-off between the two requirements. 
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3 A Formal Framework 



In this section we define the formal framework in which we study rewritings of 
aggregate queries. We extend the well-known Datalog syntax for non-aggregate 
queries [U1189| so that it covers also aggregates. These queries express nonnested 
SQL queries without a HAVING clause and with the aggregate operators min, 
max, count, and sum. A generalization to queries with the constructor UNION 
is possible, but beyond the scope of this paper. For queries with arbitrary nest- 
ing and negation no rewriting algorithms are feasible, since equivalence of such 
queries is undecidable. 



3.1 Non-aggregate Queries 

We recall the Datalog notation for conjunctive queries and extend it to aggregate 
queries. 

A term (denoted as s, t) is either a variable (denoted as x, y, z) or a constant. 
A comparison has the form si p S2, where p is either < or <. If C and C are 
conjunctions of comparisons, we write C |= C" if C is a consequence of C . We 
assume all comparisons range over the rationals. 

We denote predicates as p, q, r. A relational atom has the form p{si, . . . , Sfe). 
Sometimes we write p{s), where s denotes the tuple of terms si, . . . , s^. An atom 
(denoted as a, b) is either a relational atom or a comparison. 

A conjunctive query is an expression of the form q{xi , . . . , Xfe) <— oi & • • • & 
The atom q{xi, . . . , Xfc) is called the head of the query. The atoms ai, . . . , a„ form 
the query body. They can be relational or comparisons. If the body contains no 
comparisons, then the query is relational. A query is linear if it does not contain 
two relational atoms with the same predicate symbol. We abbreviate a query as 
q{x) ^ B{s), where B{s) stands for the body and s for the terms occurring in 
the body. Similarly, we may write a conjunctive query as q{x) ^ Ris) & C(i), in 
case we want to distinguish between the relational atoms and the comparisons in 
the body, or, shortly, as q{x) ^ R &z C. The variables appearing in the head are 
called distinguished variables, while those appearing only in the body are called 
nondistinguished variables. Atoms containing at least one nondistinguished vari- 
able are called nondistinguished atoms. By abuse of notation, we will often refer 
to a query by its head q{x) or simply by the predicate of its head q. 

A database T) contains for every predicate symbol p a relation p^ , that is, a 
set of tuples. Under set semantics, a conjunctive query q defines a new relation 
q^ , which consists of all the answers that q produces over D. Under bag-set 
semantics, q defines a multiset or bag ^q^^ of tuples. The bag l^gS'^ contains 
the same tuples as the relation q^, but each tuple occurs as many times as it 
can be derived over V with q [|CV93|| . 



Under set-semantics, two queries q and q' are equivalent if for every database, 
they return the same set as a result. Analogously, we define equivalence under 
bag-set-semantics. 
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3.2 Aggregate Queries 

We now extend the Datalog syntax so as to capture also queries with GROUP BY 
and aggregation. We assume that queries have only one aggregate term. The 



general case can easily be reduced to this one [CNS99 . We are interested in 
queries with the aggregation functions count, sum, min and max. Since results 
for min are analogous to those for max, we do not consider min. Our function 
count is analogous to the function COUNT (*) of SQL. 

An aggregate term is an expression built up using variables, the operations 
addition and multiplication, and aggregation functions.]^ For example, count and 
sum{zi * Z2), are aggregate terms. We use k as abstract notations for aggregate 
terms. If we want to refer to the variables occurring in an aggregate term, we 
write where y is a tuple of distinct variables. Note that y is empty if k is 

the count aggregation function. Terms of the form count, sum{y) and max{y) are 
elementary aggregate terms. Abstractly, elementary aggregate terms are denoted 
as a{y), where a is an aggregation function. 

An aggregate term K{y) naturally gives rise to a function /^(i/) that maps 
multisets of tuples of numbers to numbers. For instance, sum{zi*Z2) describes the 
aggregation function fsum{zi*z2) that maps any multiset M of pairs of numbers 
{mi,m2) to X;(™i,™,)eA/'™i *"^2. 

An aggregate query is a conjunctive query augmented by an aggregate term 
in its head. Thus, it has the form q{x;K{y)) <— B(s). We call x the grouping 
variables of the query. Queries with elementary aggregate terms are elementary 
queries. If the aggregation term in the head of a query has the form a{y), we 
call the query an a-query (e.g., a max-query). In this paper we are interested in 
rewriting elementary queries using elementary views. However, as the example in 
Section ^ shows, even under this restriction the rewritings may not be elementary. 

We now give a formal definition of the semantics of aggregate queries. Con- 
sider the query q{x;K{y)) <— B{s). For a database V, the query yields a new 
relation q^. To define the relation q^ , we proceed in two steps. We associate to q 
a non-aggregate query, q, called the core of q, which is defined as q{x, y) ^ B{s). 
The core is the query that returns all the values that are amalgamated in the 
aggregate. Recall that under bag-set-semantics, the core returns over T> a bag 
of tuples {d, e). For a tuple of constants d of the same length as x, let 



r,:={{e\{d,e)e{{qr}}- 



That is, the bag is obtained by first grouping together those answers to q 
that return d for the grouping terms, and then stripping off from those answers 
the prefix d. In other words, / j is the multiset of y-values that q returns for d. 
The result of evaluating q over V is 

q-^ := {{d, e) I r,- ^ and e = f^^M)}. 



^ This definition blurs the distinction between the function as a mathematical object 
and the symbol denoting the function. However, a notation that takes this difference 
into account would be cumbersome. 
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Intuitively, whenever there is a nonempty group of answers with index d, then 
we apply the aggregation function /k(i/) to the multiset of y- values of that group. 

Again, two aggregate queries q and q' are equivalent if q^ = q'^ for all 
databases V. 

3.3 Equivalence Modulo a Set of Views 

Up until now, we have defined equivalence of aggregate queries and equivalence of 
non-aggregate queries under set and bag-set-semantics. However, the relationship 
between a query q and a rewriting r of q is not equivalence of queries, because 
the view predicates occurring in r are not regular database relations, but are 
determined by the base relations indirectly. In order to take this relationship 
into account, we define equivalence of queries modulo a set of views. 

We consider aggregate queries that use predicates both from TZ, a set of 
base relations, and V, a set of view definitions. For a database V, let 2?v be 
the database that extends V by interpreting every view predicate v G V as the 
relation v^. If g is a query that contains also predicates from V, then q^^ is the 
relation that results from evaluating q over the extended database I?v If Q, q' 
are two aggregate queries using predicates from TZUV, we define that q and q' 
are equivalent modulo V, written q =v q', if g'^'^ = q'^^ for all databases V. 

3.4 General Definition of Rewriting 

We give a general definition of rewritings. Let q be a query, V be a set of views 
over the set of relations TZ, and r be a query over V U TZ. All of q, r, and the 
views in V may be aggregate queries or not. Then we say that r is a rewriting of 
q using V if g =v r and r contains only atoms with predicates from V. If q =v i^ 
and r contains at least one atom with a predicate from V we say that r is a 
partial rewriting of q using V. 

Now we can reformulate the intuitive questions we asked in the end of the 
Section ||. 

— Rewriting Verification: Given queries q and r, and a set of views V, check 
whether q =v r. 

— Rewriting Computation: Given a query q and a set of views V, find all 
(some) rewritings or partial rewritings of q. 

4 Rewritings of Aggregate Queries 

We now present techniques for rewriting aggregate queries. Our approach will be 
to generalize the known techniques for conjunctive queries. Therefore, we first 
give a short review of the conjunctive case and then discuss in how far aggregates 
give rise to more complications. 
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4.1 Reminder: Rewritings of Relational Conjunctive Queries 

We review the questions related to rewriting relational conjunctive queries. Sup- 
pose, we are given a set of conjunctive queries V, the views, and another con- 
junctive query q. We want to know whether there is a rewriting of q using the 
views in V. 

The first question that arises is, what is the language for expressing rewrit- 
ings? Do we consider arbitrary first order formulas over the view predicates 
as candidates, or recursive queries, or do we restrict ourselves to conjunctive 
queries over the views? Since reasoning about queries in the first two languages 
is undecidable, researchers have only considered conjunctive rewritings.^ Thus, 
a candidate for rewriting q{x) has the form r{x) ^ vi{9ixi) & ... & Vn{9nXn), 
where the ^^'s are substitutions that instantiate the view predicates Vi{xi). 

The second question is whether we can reduce reasoning about the query r, 
which contains view predicates, to reasoning about a query that has only base 
predicates. To this end, we unfold r. That is, we replace each view atom Vi{9iXi), 
with the instantiation 9iBi of the body of Vi, where Vi is defined as Vi{xi) <— Bi. 
We assume that the nondistinguished variables in different occurrences of the 
bodies are distinct. We thus obtain the unfolding r" of r, for which the Unfolding 
Theorem holds, r^ix) ^ OiBi k ... k 6'„B„. 

Theorem 1 (Unfolding Theorem). Let V be a set of views, r a query over 
V , and r" he the unfolding of r. Then r and r" are equivalent modulo V, that is, 

r =v r". 



The third question is how to check whether r is a rewriting of q, that is, 
whether r and q are equivalent modulo V. This can be achieved by checking 
whether r" and q are set-equivalent: if r" = q, then the Unfolding Theorem im- 
plies r =v q. Set-equivalence of conjunctive queries can be decided syntactically 
by checking whether there are homomorphisms in both directions [U1189|. 



4.2 Rewritings of Count-queries 

When rewriting count-queries, we must deal with the same questions that arose 
when rewriting conjunctive queries. Thus, we first define the language for ex- 
pressing rewritings. Even if we restrict the language to conjunctive aggregate 
queries over the views, we still must decide on two additional issues. First, 
which types of aggregate views are useful for a rewriting? Second, what will 
be the aggregation term in the head of the rewriting? A count-query is sensitive 
to multiplicities, and count-views are the only type of aggregate views that do 

^ It is an interesting theoretical question, which as yet has not been resolved, whether 
more expressive languages give more possibilities for rewritings. It is easy to show, 
at least, that in the case at hand allowing also disjunctions of conjunctive queries as 
candidates does not give more possibilities than allowing only conjunctive queries. 
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not lose multiplicities]^ Thus, the natural answer to the first question is to use 
only count- views when rewriting count-queries. We show in the following exam- 
ple that there are an infinite number of aggregate terms that can be usable in 
rewriting a count-query. 

Example 1. Consider the query 

q_positions_per_type(j; count) <— ta(7i, c, j) 

in which we compute the number of positions of each type held in the uni- 
versity. Recall the view v_positions_per_type defined in Section |^. It is easy 
to see that both of the following are rewritings of q_positions_per_type: 

ri(j';z) ^ v_positions_per_type(j''; z) 
r2(j'; ^/zl'*~Z2) ^ v_positions_per_type(j''; zi) & v_positions_per_type(j'; Z2). 

By adding additional view atoms and adjusting the power of the root we 
can create infinitely many different rewritings of q_positions_per_type. It is 
natural to create only ri as a rewriting of q. In fact, only for ri will the Unfolding 
Theorem hold. 

We define a candidate for a rewriting of q{x] count) ^ R h C as a query 
having the form 

n 

r{x] sum{Y\_Zi)) ^ vl{9iXi;zi) & ... & w^(6'„x„; z„) & C", 
1=1 

where v'l are count-views, possibly with comparisons, defined as (ij; count) «— 
Bi and Zi are variables not appearing elsewhere in the body of r. We call r a 
count-rewriting candidate. 

Note that it is possible to omit the summation if the values of Zi are func- 
tionally dependent on the value of the grouping variables x. This is the case, if 
only grouping variables appear as OiXi in the heads of the instantiated views. 
Then the summation is always over a singleton group. 

After presenting our rewriting candidates we now show how we can reduce 
reasoning about rewriting candidates, to reasoning about conjunctive aggregate 



queries. We use a similar technique to that shown in Subsection 4.1. In the 
unfolding, we replace the view atoms of the rewriting with the appropriate in- 
stantiations of their bodies, and we replace the aggregate term in the rewriting 
with the term count. Thus, we obtain as the unfolding r" of r the query 

r"(x; count) ^ OiBi & ... & 6I„B„ & C . 



^ Although sum- views are sensitive to multiplicities (i.e., are calculated under bag- 
set-semantics), they lose these values. For example, sum- views ignore occurrences of 
zero values. 



10 



S.Cohen, W.Nutt, A.Serebrenik 



In |CNS99|, it has been proven that for r" the Unfolding Theorem holds, i.e., 
r =v ■ Moreover, it has been shown that this definition of unfolding uniquely 
determines the aggregation function in the head of our candidates. That is, 
summation over products of counts is the only aggregation function for which 
the Unfolding Theorem holds if r" is defined as above. Now, in order to verify 
that r is a rewriting of q, we can check that r" is equivalent to r, without taking 
into account the views any more. 

We now present an algorithm that finds a rewriting for a count-query using 
views. Our approach can be thought of as reverse engineering. We have charac- 
terized the "product" that we must create, i.e., a rewriting, and we now present 
an automatic technique for producing it. 



In |NSS98|, a sound and complete characterization of equivalence of con- 
junctive count-queries with comparisons has been given. The only known algo- 
rithm that checks equivalence of conjunctive count-queries creates an exponen- 
tial blowup of the queries. Thus, it is difficult to present a tractable algorithm 
for computing rewritings. However, it has been shown CV93| , NSS9£ | that two 



relational count-queries are equivalent if and only if they are isomorphic. In addi- 
tion, equivalence of linear count-queries with comparisons is isomorphism of the 



queries |NSS98[ . Thus, we will give a sound, complete, and tractable algorithm 
for computing rewritings of relational count-queries and of linear count-queries. 
This algorithm is sound and tractable for the general case, but is not complete. 

We discuss when a view count) ^ Ry k Cy, instantiated by 9, is usable 
in order to rewrite a query q{x; count) ^ R C , that is, when the instantiated 
view can occur in a partial rewriting. By the characterization of equivalence for 
relational and linear queries, a rewriting of q is a query r that when unfolded 
yields a query isomorphic to q. Thus, in order for 9v, to be usable, ORy must 
"cover" some part of R. Therefore, 9v is usable for rewriting q only if there exists 
an isomorphism, 93, from 6Ry to R' C R. Note that we can assume, w.l.o.g. that 
if is the identity mapping on the distinguished variables of v. We would like to 
replace R' with 9v in the body of q in order to derive a partial rewriting of q. 
This cannot always be done. Observe that after replacing R' with 9v^ variables 
that appeared in R' and do not appear in 9u (i.e., the nondistinguished variables 
in v) are not accessible anymore. Thus, we can only perform the replacement if 
these variables do not appear anywhere else in g, in q's head or body. We capture 
this property by defining that v{u] count) <— Ry & Cy is R-usahle under 9 w.r.t. 
ip if 

1. (f9Ry is isomorphic to a subset R' of R, and 

2. every variable that occurs both in R' and in R \ R' must occur in 9u. 

We denote this fact as i?-usable(t', 9, ip). Clearly, there is a partial rewriting using 
V of q{x; count) ^ R &c C only if count) <— Ry & Cy is _R-usable under 9 
w.r.t. some ip. 

Example 2. Consider the following query that computes the number of sponsors 
for each assistant in the database course 

q_db_ta_sponsors(n; count) ^ ta(n. Database, j) & salaries(j, s, a). 
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We suppose that we have a materiahzed view that computes the number of 
jobs that each teaching assistant has in each course that he assists 

v_jobs_per_ta(n', c'; count) ^ ta(n', c', j'). 

In order to use v_jobs_per_ta in rewriting q_db_ta_sponsors we must find 
an instantiation 9 such that 9ta{n' , c' , j') covers some part of the body of 
q_db_ta_sponsors. Clearly, 9ta(n' , c' , j') can cover only ta(n, Database, j). We 
take, 9 = {n'/n, c'/Database} and thus, ip — {n/n, j' /j}. However, j appears 
in ta(ri. Database, j) and not in the head of ^v_jobs_per_ta and therefore, j 
is not accessible after replacement. Note that j appears in salaries and thus, 
v_jobs_per_ta is not i?-usable in rewriting q_db_ta_sponsors. 

For our algorithm to be complete for linear queries, the set of comparisons 
in the query to be rewritten has to be deductively closed (see Example 
The deductive closure of a set of comparisons can be computed in polynomial 



time |Klu8S|. In addition, it must hold that C |= ip{9Cv), thus, the comparisons 
inherited from v are weaker than those in q. For a rewriting using 9v to exist 
it must be possible to strengthen (p{9Cv) by additional comparisons C" so that 
(f{9Cv) & C is equivalent to C. We have seen that when replacing R' with 9v 
we lose access to the nondistinguished variables in v. Therefore, it is necessary 
for the comparisons in ip{9Cy) to imply all the comparisons in q that contain 
an image of a nondistinguished variable in v. Formally, let ndv(v) be the set of 
nondistinguished variables in v. Let C'^iendviv)) consist of those comparisons m 
C that contain variables in ip{9ndv{v)). Then, in order for 9v to be usable in a 
partial rewriting, \= C^(>^'"^Hv)) must hold. If this condition and C \= ip{9C^) 
hold, then we say that v is C-usable under 9 w.r.t. Lp and write C-usable(v, 9, 

We summarize the discussion in a theorem. 

Theorem 2. Let q{x\ count) ^ R C he a count-query whose set of compar- 
isons C is deductively closed, and let v{u; count) Ry &l Cy be a count-view. 
There exists a partial rewriting of q using v if and only if there is a Lp such that 
i?-usable(w, 9, (p) and C-usable(w, 9, ip). 

Example 3. The following query computes for each job the number of mediocre 
sponsors, i.e., the number of sponsors who give an amount that is greater than 
$200 and less than $600. 

q_mediocre_sponsor(j'; count) ^ salaries(j, s, a) & a > 200 & a < 600. 

The view 

v_all_sponsor(j'; count) ^ salaries(j', s', a') & a' > 



computes the number of sponsors for each job. In order to use v_all_sponsor 
in rewriting q_mediocre_sponsor we clearly must take 9 = {j'/j} and (p = 
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{j/j,s'/s,a'/a}. It holds that {a > 200 & a < 600} h W^io-' > 0)}- Ob- 
serve that a' is a nondistinguished variable in v_all_sponsor and a' is mapped 
to a by LpO. Thus, in order for v_all_sponsor to be C-usable for rewriting 
q_mediocre_sponsor it must hold that {ip9{a' > 0)} \= {a > 200 & a < 600}. 
This does not hold. Therefore, v_all_sponsor is not C-usable for rewriting 
q_mediocre_sponsor. 

We present an algorithm for computing rewritings of conjunctive count- 
queries in Figure ^. The underlying idea is to incrementally cover the body 
of the query by views until no atom is left to be covered. The algorithm non- 
deterministically chooses a view v and an instantiation 0, such that v is both 
i?-usable and C-usable under 6. If the choice fails, backtracking is performed. 

When the while-loop is completed, the algorithm returns a rewriting. By 
backtracking we can find additional rewritings. Of course, the nondeterminism in 
choosing the views can be further reduced, for instance, by imposing an ordering 
on the atoms in the body of the query and by trying to cover the atoms according 
to that ordering. Note, that the same algorithm may be used to produce partial 
rewritings if we relax the termination condition of the while-loop. This will 
similarly hold for subsequent algorithms presented. 

We note the following. In Line 9, R is changed and thus, q is also changed. 
Therefore, at the next iteration of the while-loop we check whether v is i?-usable 
under 9 to rewrite the updated version of q (Line 6). Thus, in each iteration 
of the loop, additional atoms are covered. In Line 10, the algorithm checks if a 
nondistinguished atom is already covered. If so, then the algorithm must fail, 
i.e., backtrack, as explained above. 

Observe that we modify C in Line 12. We remove from C its comparisons 
containing a variable that is not accessible after replacing the appropriate subset 
of R by the appropriate instantiation of v. These comparisons are not lost be- 
cause V is C-usable. The comparisons remaining in C are needed to strengthen 
those inherited from the views such that they are equivalent to the comparisons 
in the query to be rewritten. 

Count_Rewriting is both sound and complete for linear queries and relational 
queries and is sound, but not complete, for arbitrary queries. Our algorithm 
runs in nondeterministic polynomial time by guessing views and instantiations 
and verifying in polynomial time that the obtained result is a rewriting. For 
relational queries this is optimal, since checking whether there exists a 9 such 
that V is i?-usable under 9 is NP-hard, which can be shown by a reduction of 
the graph matching problem. Since for linear queries q and views v the existence 
of 6 and (p such that _R-usable(u, 6, (p) and C-usable(i), 9, (p) can be decided in 
polynomial time, one can obtain a polynomial time variant of the algorithm that 
computes partial rewritings in the linear case. 

Theorem 3. (Soundness and Completeness of Count Rewriting) Let q 

be a count-query andV be a set of views. Ifr is returned by Count_Rewriting((7, V), 
then r is a count-rewriting candidate of q and r =v q- If q is either linear or 
relational, then the opposite holds by making the appropriate choices. 
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Algorithm Count-Rewriting 




Input 


A query q{x; count) ^ R C and a set of views V 




Output A rewriting r of q. 






Not-Covered := R. 




(2) 


Rewriting := 0. 




(3) 


n := 0. 




(4) 


While Not.Covered ^ do: 






Choose a view v{x'; count) R' k. C in V. 




(6) 


Choose an instantiation, 0, and an isomorphism 99, 






such that i?-usable(u, 9, tp) and C-usable(t;, 




(7) 


For each atom a £ R' do: 




(8) 


If a is a nondistinguished atom, then 




(9) 


Remove (p{6a) from R. 




(10) 


If ip{9a) ^ Not.Covcrcd then fail. 




(11) 


Remove (p{6a) from Not_Covcrcd. 




(12) 


Remove from C comparisons containing a variable 


in if{9R'), 




but not in Ox' 




(13) 


Increment n. 




(14) 


Add v{9x'; Zn)) to Rewriting, where z„ is a fresh variable. 


(15) 


Return r{x; sum(n"=i ^i)) <— Rewriting & C. 





Fig. 1. Count Query Rewriting Algorithm 



Example 4- This example shows the incompleteness of the algorithm if the com- 
parisons in the query being rewritten are not deductively closed. Consider the 
following query q, and views vi and V2, defined as 

q{count) ^ Pi{x) & P2{y) k x < y &l y <2 k. 

P3{u) & P4{'w) k. u < w k. w <2 
v\{x,u; count) pi{x) k P2{y) k x < y k y <2 k ps{u) ku <2 
V2{x,u; count) <— P3{u) k P4{w) k u < w k w <2 k pi{x) kx <2. 

Note that the comparisons in q are not deductively closed since q does not contain 
x < 2 and u < 2. The algorithm Count_Rewriting will not find any rcwritings 
of q using vi and V2- We can understand this in the following way. Suppose 
that the view v\ is chosen first. Clearly, vi can be used for a rewriting taking 
the instantiation 9 and the isomorphism to be the identity mappings. The 
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algorithm Count_Rewriting removes from q the comparisons x < y and y < 2 since 
they contain the variable y which is an image of the nondistinguished variable 
y in vi. However, Count_Rewriting can no longer use V2 in the rewriting since 
the constraints in q no longer imply the constraint a; < 2 in W2- For symmetric 
reasons, Count_Rewriting would also fail to find a rewriting if V2 was chosen first. 
However, clearly the following is a rewriting of q using vi and V2- 

r{sum{zi * Z2)) ^ u; zi) & V2{x, u; Z2). 



Example 5. The algorithm is incomplete for the general case. Consider the fol- 
lowing query q, and view v 

q{; count) ^ p{x) & p{y) & p{u) Szx<ySzx<u 
v{; count) ^ p{x') & p{y') & p{u') kx' <y' ku' <y' 



Clearly q and v are equivalent | NSS9§| . However, for all homomorphisms 



from V to q, it holds that {x < y k, x < u} {^{x' < y') & ip{u' < y')}. Thus, 
V is not C-usable for rewriting q and the algorithm docs not find any rewritings. 



4.3 Rewritings of Sum-Queries 

Rewriting sum-queries is similar to rewriting count-queries. When rewriting sum- 
queries we must also take the summation variable into consideration. We present 
an algorithm for rewriting sum-queries that is based on the algorithm for count- 
queries. 

We define the form of rewriting candidates for sum-queries. Since sum and 
count- views are the only views that are sensitive to multiplicities, they are useful 
for rewritings. However, sum-views may lose multiplicities and make the aggre- 
gation variable inaccessible. Thus, at most one sum-view should be used in the 
rewriting of a query. The following are rewriting candidates for sum-queries: 

n 

ri (x; sum(y * JJzi)) ^ ^^(eixi; zi) & ... k v'l^{9nXn; Zn) k C (1) 

n 

r2{x; sum{y *Y\_^i)) ^ '"''i^sXs^y) k vlieixi; zi) k ... & <(6'„x„; z„) & C(2) 



where v'^ is a count-view of the form (x^; count) ^ Bi and is a sum-view of 
the form v'^{xs', sum{y)) <~ Bg. Note that the variable y in the head of the query 
in Equation |l| must appear among 9iXi for some i. In |CNS99|, it has been shown 
that if a rewriting candidate is equivalent modulo the views to its unfolding then 
it must be one of the above forms. As in the case of count-query rewritings, in 
some cases the rewriting may be optimized by dropping the summation. 

Once again, we reduce reasoning about rewriting candidates to reasoning 
about conjunctive aggregate queries. For this purpose we extend the unfolding 
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technique introduced in Subsection 4.2. Thus, the unfoldings of the candidates 
presented are: 

r"(x;sum(y)) ^ 0iBi & ... & 6l„B„ & C". 

r!^{x; sum{y)) <- 0sBs & OiBi & ... & 6l„S„ & C". 

Now, instead of checking whether r is a rewriting of q we can verify whether 
r" is equivalent to r. T he only known algorithm for checking equivalence of sum- 



queries, presented in [NSS9S], requires an exponential blowup of the queries. 
However, relational sum-queries and linear sum-queries are equivalent if and 
only if they are isomorphic. Thus, we can extend the algorithm presented in the 
Figure ^ for sum-queries. 

We first extend the algorithm in Figure 0, such that in Line 5 sum-views 
may be chosen as well. We call this algorithm Compute_Rewriting. We derive an 
algorithm for rewriting sum-queries, presented in Figure |^. The algorithm runs 
in nondeterministic polynomial time. 



Algorithm Sum_Rewriting 

Input A query q{x; sum(y)) <— B and a set of views V 

Output A rewriting r of q. 



(1) Let q' be the query q'{x; count) ^ B. 

(2) Let r'=Compute_Rewriting(q', V). 

(3) If r' is of the form 

r'{x; sum{y * HiLi ^i)) ^ v^OsX^; y) & uj(6'ixi; zi) & ... & 

)kC' 

(4) Then return r' 

(5) If r' is of the form 

r'{x\sum{Y\^^^Zi)) ^ vl{9ixi;zi) k ... & u^(0„S„; z„) & C" 
and y appears among 9iXi 

(6) Then return 

r{x;sum{y *]Xl^-^Zi)) ^ vl{9ixi,zi) k ... k v^^{9nXn, Zn) k C . 



Fig. 2. Sum Query Rewriting Algorithm 



Theorem 4. (Soundness and Completeness of Sum Rewriting) Let q be 

a sum-query and V be a set of views. If r is returned by Sum_Rewriting(q, V), 
then r is a sum-rewriting candidate of q and r =v q. If q is linear or relational, 
then the opposite holds by making the appropriate choices. 
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4.4 Rewritings of Max-Queries 

We consider tlie problem of rewriting max-queries. Note tliat max-queries are 
insensitive to multiplicities. Thus, we use nonaggregate views and max-views 
when rewriting a max-query. When using a max- view the aggregation variable 
becomes inaccessible. Thus, we use at most one max-view. The following are 
rewriting candidates of the query q: 

ri{x;max{y)) ^ vi{dixi) ... & w„(0„x„) & C" (3) 
r2{x;max{y)) ^ v"\e,nXm]y) & & ... & f„(6'„x„) & C (4) 

Note that the ViS are nonaggregate views and that is a max-view. The 
variable y in the head of the query in Equation y must appear among OiXi for 



some i. In [ CNS99 | it has been shown that if a rewriting candidate is equivalent 
to its unfolding then it must have one of the above forms. 

Reasoning about rewriting candidates can be reduced to reasonin g about 



max-queries, by extending the unfolding technique. It has been shown ||NSS98 | 
that equivalence of relational max-queries is equivalence of their cores. There is a 
similar reduction for the general case. Algorithms for checking set-equivalence of 
queries can easily be converted to algorithms for checking equivalence of max- 
queries. Thus, algorithms that find rewritings of nonaggregate queries can be 
modified to find rewritings of max-queries. 



Rewriting nonaggregate queries is a well known problem [LMSS95|. One well 



known algorithm for computing rewritings of queries is the buckets algorithm 



[ [LR096b| , |LR096a[ . Consider a query q{x) ^ R k C. The algorithm creates a 
"bucket" for each atom p{z) in R. Intuitively, this bucket contains all the views 
whose bodies can cover p(z). The algorithm places into this bucket all the views 
v{y) <— Ry such that Ry contains an atom p{'w) that can be mapped by 

some mapping (p to p{z) such that C & LpC is consistent. Next, all combinations 
of taking a view from each bucket are considered in the attempt to form a 
rewriting. 

Note that by reasoning similarly as in the case of count and sum-queries, 
we can reduce the number of views put into each bucket, thus improving on 
the performance of the algorithm. Suppose there is a nondistinguished variable 
w G w mapped to z € z and there is an atom containing z in q that is not 
covered by ipRv In such a case, if v is used in a rewriting candidate there will 
not exist a homomorphism from the unfolded rewriting to q such that the body 
of V covers p{z). However, a rewriting candidate r is equivalent to a query q if 
and only if there exist homomorphisms from r" to g and from q to r". Thus, v 
should not be put in the bucket of p{z). 

Observe that this condition is a relaxed version of the i?- usability requirement 
that ensures the existence of an isomorphism. Clearly this restriction filters out 
the possible rewritings of g, thereby improving the performance of the buckets 
algorithm. Thus, our methods for finding rewritings of aggregate queries may 
be relaxed and used to improve the performance of algorithms for rewriting 
relational queries. These, in turn, may be modified to rewrite max-queries. 
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5 Conclusion 



Aggregate queries are increasingly prevalent due to the widespread use of data 
warehousing and related applications. They are generally computationally ex- 
pensive since they scan many data items, while returning few results. Thus, the 
computation time of aggregate queries is generally orders of magnitude larger 
than the result size of the query. This makes query optimization a necessity. 

Optimizing ag gregate queries using views has been studied for the special 
case of datacubes iHRU96| , |Dyr96| . However, there was little th eory for genera l 
aggregate queries. In this paper, based on previous results in [ NSS98 , CNS99 1 , 
we presented algorithms that enable reuse of precomputed queries in answering 
new ones. The algorithms presented have been implemented in SICStus Prolog. 
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A Translating SQL to Datalog 

In this paper we extended the well-known Datalog syntax for non-aggregate 



queries |U1189|] so that it covers also aggregates. This syntax is more abstract and 
concise than SQL. It is not only better suited for a theoretical investigation, but 
it is also a better basis for implementing algorithms that reason about queries, 
in particular for implementations in a logic programming language. 

Through the syntax we implicitly define the set of SQL queries to which our 
techniques apply. They are essentially nonnested queries without a HAVING clause 
and with the aggregate operators min, max, count, and sum. In this section we 
demonstrate, using examples, how an SQL query of this type can be transformed 
into one in our extended Datalog notation. 

We first show how to transform an SQL query without aggregation into one 
in Datalog notation. Consider a query with SELECT, FROM, and WHERE clauses. 
For each relation name in the FROM clause we introduce a predicate name, and 
for each attribute of a relation, we fix an argument position of the corresponding 
predicate. For each occurrence of a relation name in the FROM clause we create 
a relational atom. The selection constraints in the WHERE clause are taken into 
account by placing constants or identical variables into appropriate argument 
positions of the atoms corresponding to a relation, or by imposing comparisons 
on variables. Finally, the output arguments in the SELECT clause appear as the 
distinguished variables in the head. 

We demonstrate the translation using an example. This example can easily 
be generalized to arbitrary SQL queries without GROUP-BY and HAVING clauses. 
Consider a query that finds the teaching assistants who have a job for which 
they receive more then $500 from the government: 



SELECT name 
FROM ta, salaries 

WHERE sponsorship = 'Govt.' AND amount > 500 
AND ta.job_type = salaries.job_type. 

We translate this query into an equivalent Datalog query with the head pred- 
icate q_govt. For the relation names ta and salaries we introduce the predicate 
names ta and salaries. In the fashion described above, we derive the following 
equivalent Datalog query: 

q_govt(n) ^ salaries(j'. Govt., a) & ta(n,c,j) & a > 500. 

In this paper we extended the Datalog syntax so as to capture also queries 
with GROUP BY and aggregation. Using our notation, we can represent SQL 
queries where the group by attributes are identical to those in the SELECT state- 
ment, although SQL only requires that the latter be a subset of those appearing 
in the GROUP BY clause. Also, we assume that queries have only one aggregate 
term. The general case can easily be reduced to this one. 
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The extension of the Datalog syntax is straightforward. Since the SELECT 
attributes are identical to the grouping attributes, there is no need to single 
them out by a special notation. Hence, the only new feature is the aggregate 
term in the SELECT clause. We simply add it to the terms in the head of the 
query, after replacing the attributes with corresponding variables. 

To demonstrate this translation, recall the query in Section |^ that calculates 
the total amount of money spent on each job type. The following SQL query can 
be transformed into the previously mentioned Datalog query: 

SELECT ta.job_type, sum(amount) 

FROM ta, salaries 

WHERE ta.job_type = salaries.job_type. 

We have demonstrated how to translate SQL into Datalog. Obviously, the 
translation from Datalog to SQL can be performed in a similar fashion. Roughly 
speaking, we replace predicates with relations and variables with attributes. The 
variables in the head of the Datalog query become attributes in the SELECT clause 
of the SQL query and the comparisons are placed in the WHERE clause. Hence, 
one notation can be transformed into the other, back and forth, completely 
automatically. 



